Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Workflow Registry Metrics and remove executions running counter #16463

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

patrickhuie19
Copy link
Contributor

@patrickhuie19 patrickhuie19 commented Feb 18, 2025

The executions running gauge can be misleading since it is updated on heartbeat, which is async to workflow deletes through the workflow registry.

I saw we're lacking some metrics in the workflow registry and so added coverage.

@patrickhuie19 patrickhuie19 requested a review from a team as a code owner February 18, 2025 21:49
@patrickhuie19 patrickhuie19 changed the title changed engineCleanup from best effort based on service health to def… Engine Syncer Change + Metrics cleanup Feb 18, 2025
Copy link
Contributor

I see you updated files related to core. Please run pnpm changeset in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

🎖️ No JIRA issue number found in: PR title, commit message, or branch name. Please include the issue ID in one of these.

@@ -1153,7 +1153,6 @@ func (e *Engine) heartbeat(ctx context.Context) {
return
case <-ticker.C:
e.metrics.incrementEngineHeartbeatCounter(ctx)
e.metrics.updateTotalWorkflowsGauge(ctx, e.stepUpdatesChMap.len())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am removing this metric entirely. Read PR description for why

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... this still feels like it could be useful @patrickhuie19 -- I can imagine scenarios where the number of executions is just increasing, and not going down such that we'll eventually exhaust the queue. In that case this metric would help us diagnose what's going on.

You mentioned that this relates to deletes of the workflow engine -- can you say more about how? I'm not following entirely

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that one source of confusion here is that this name of this metric (total workflows running) doesn't correspond to what it's measuring (number of executions running an in instance of the workflow engine); maybe that's the source of confusion here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be more confusing than helpful. The name is part of it. The metric is updated on a heartbeat, which is async to engine.Start() and engine.Close().

I can imagine scenarios where the number of executions is just increasing, and not going down such that we'll eventually exhaust the queue

Thanks for the example use case. Would we be able to satisfy this by looking at the completed duration histogram metrics and observing no executions are finishing?

@@ -1266,7 +1265,7 @@ const (
defaultQueueSize = 100000
defaultNewWorkerTimeout = 2 * time.Second
defaultMaxExecutionDuration = 10 * time.Minute
defaultHeartbeatCadence = 5 * time.Minute
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 minutes is anemic

// tryEngineCleanup attempts to stop the workflow engine for the given workflow ID. Does nothing if the
// workflow engine is not running.
func (h *eventHandler) tryEngineCleanup(wfID string) error {
if h.engineRegistry.IsRunning(wfID) {
Copy link
Contributor Author

@patrickhuie19 patrickhuie19 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had logic based on whether or not the underlying Ready implementation in the embedded services.StateMachine would return an error. The Engine has no subservices, Close() isn't marked as syncronous with Ready in services.StateMachine, and if the Engine wasn't Ready, we would do nothing anyways.

If we want to stay flexible to checking the status of Ready in the future, we could add some type of sync loop to check status and delete when Ready?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickhuie19 Are we subtly changing behaviour here? We'll return an error now if we can't retrieve the engine from the registry; whereas previous we would have ignored that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll change it so that engineCleanup errors under the same condition as previously to keep the scope of this PR smaller. The tradeoff with handling a lacking engine instance this gracefully is that it might fly under our radar when engine instances are getting cleared from the registry before they should.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickhuie19 patrickhuie19 requested a review from Atrax1 February 18, 2025 21:58
Copy link
Contributor

github-actions bot commented Feb 18, 2025

AER Report: CI Core

aer_workflow , commit , Clean Go Tidy & Generate , Detect Changes , Scheduled Run Frequency , Core Tests (go_core_tests) , Core Tests (go_core_tests_integration) , Core Tests (go_core_ccip_deployment_tests) , GolangCI Lint (.) , Core Tests (go_core_fuzz) , Core Tests (go_core_race_tests) , test-scripts , lint , SonarQube Scan

1. GolangCI Lint job failed due to non-wrapping format verb in fmt.Errorf

[A 1 <= 10 words sentence that describes the error]:[job id where the error happened]
Source of Error:

Source of Error:
Golang Lint (.)	2025-02-27T21:20:27.9803622Z ##[error]core/services/workflows/delegate.go:50:167: non-wrapping format verb for fmt.Errorf. Use `%w` to format errors (errorlint)
Golang Lint (.)	2025-02-27T21:20:27.9805430Z 		return fmt.Errorf("delegate failed to unregister workflow engine for workflow name: %s id: %s: %v", spec.WorkflowSpec.WorkflowName, spec.WorkflowSpec.WorkflowName, err)
Golang Lint (.)	2025-02-27T21:20:27.9810742Z 		 ^
Golang Lint (.)	2025-02-27T21:20:27.9812508Z ##[error]core/services/workflows/delegate.go:101:149: non-wrapping format verb for fmt.Errorf. Use `%w` to format errors (errorlint)
Golang Lint (.)	2025-02-27T21:20:27.9814103Z 		return nil, fmt.Errorf("delegate failed to register workflow engine for workflow name: %s id: %s: %v", cfg.WorkflowName.String(), cfg.WorkflowID, err)
Golang Lint (.)	2025-02-27T21:20:27.9818405Z 		 ^

Why: The error is caused by the use of %v instead of %w in fmt.Errorf for wrapping errors. The %w verb is required to wrap errors in Go.

Suggested fix: Replace %v with %w in the fmt.Errorf calls to properly wrap the errors. For example, change fmt.Errorf("... %v", err) to fmt.Errorf("... %w", err).

AER Report: Operator UI CI ran successfully ✅

aer_workflow , commit

@patrickhuie19 patrickhuie19 force-pushed the fix/execution-num-metric branch from 86e8af2 to b7c5a58 Compare February 22, 2025 00:21
@patrickhuie19 patrickhuie19 changed the title Engine Syncer Change + Metrics cleanup Add Workflow Registry Metrics to help chase down possible deletion bug Feb 22, 2025
Copy link
Contributor

@bolekk bolekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm too tired but I'm really struggling to make sense of the description of this PR.

What I see is going on:

  1. You are removing "workflowsRunningGauge" because it is pretty useless. It's counting current number of executions at a given point in time, updated only with the heartbeat, correct?
  2. You are adding a bunch of new counters to track all events in the Workflow Registry. Those make sense to me, especially "totalWorkflowsGauge". It won't cover workflows added outside the Registry but I guess that's fine.

What I don't understand is how this is going to help debug why metrics are lingering after workflow deletion. You say that other metrics hang around too, not only the executions gauge. Are you able to reproduce it locally? What new signal are you expecting to get once we start running this in prod?

@patrickhuie19 patrickhuie19 changed the title Add Workflow Registry Metrics to help chase down possible deletion bug Add Workflow Registry Metrics and remove executions running counter Feb 22, 2025
@patrickhuie19
Copy link
Contributor Author

Maybe I'm too tired but I'm really struggling to make sense of the description of this PR.

What I see is going on:

1. You are removing "workflowsRunningGauge" because it is pretty useless. It's counting current number of executions at a given point in time, updated only with the heartbeat, correct?

2. You are adding a bunch of new counters to track all events in the Workflow Registry. Those make sense to me, especially "totalWorkflowsGauge". It won't cover workflows added outside the Registry but I guess that's fine.

What I don't understand is how this is going to help debug why metrics are lingering after workflow deletion. You say that other metrics hang around too, not only the executions gauge. Are you able to reproduce it locally? What new signal are you expecting to get once we start running this in prod?

Thanks, simplified the description. Your points 1 and 2 are correct. There were multiple levers into the changes in this PR, and they are worthwhile on their own outside of the context of the time-series persisting post deletion bug we're tracking down.

bolekk
bolekk previously approved these changes Feb 22, 2025
Copy link
Contributor

@bolekk bolekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

metrics.incrementRegisterCounter(ctx)

// intentionally without workflow specific labels
h.metrics.updateTotalWorkflowsGauge(ctx, int64(h.engineRegistry.Size()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but slightly misleading: this really only measures workflows that have been started by the syncer -- we should add a increment in the workflows/delegate to give an accurate picture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


m, err := initMonitoringResources()
if err != nil {
lggr.Fatalw("Failed to initialize monitoring resources", "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. this will kill the node. that seems unreasonable, we don't know what else is running.

where is this init all use to live?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it really should block the whole node, then panic

if not, then critical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this particular error to get invoked it signals a potentially catastrophic regression in Beholder impl. But agreed we don't know what else is running

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants